Skip to content

PART 1 — BIG PICTURE

In industrial desktop systems, failure handling is not a side topic. It is part of the core design.

In a normal business app, an error often means one request failed, one screen failed, or one save operation failed. In a wafer inspection machine system, an error can mean the machine is still moving while the UI thinks it stopped, the camera stopped delivering images but the workflow keeps running, or inspection results are only half saved while the operator believes the lot is complete.

That is why failure handling is a first-class concern.

Why “just catch exception” is not enough

A lot of engineers learn error handling as:

  • wrap code in try/catch
  • log exception
  • show message
  • continue

That is nowhere near enough for a machine-control system.

Because the real question is not just:

“Did something throw?”

The real questions are:

  • What kind of failure is this?
  • Is the operation safe to retry?
  • Has the machine state changed or not?
  • Is the workflow still trustworthy?
  • Can the operator continue?
  • Do we need to stop the machine?
  • Do we need to mark results as invalid?
  • Can we recover automatically, or do we need human intervention?

That is the real problem.

Why resilience is different in desktop + hardware systems

Web and backend systems usually deal with stateless requests. If one request fails, you retry, return 500, or let another instance handle the next request.

A desktop app controlling hardware is different:

  • it is long-running
  • it often owns live device connections
  • it holds in-memory workflow state
  • it interacts with machines that have physical behavior
  • it must keep UI state, machine state, and workflow state aligned

That makes failure handling harder.

Here are some real examples.

Machine disconnection

Suppose the app is sending stage movement commands to the machine controller. The Ethernet connection drops. The app may not know whether:

  • the command never reached the machine
  • the command was accepted but the response was lost
  • the machine is still moving
  • the machine entered alarm state

This is much worse than a failed HTTP request.

Camera timeout

The camera SDK waits for a frame and times out after 2 seconds. Is this a harmless delay? Is the trigger cable disconnected? Is the camera frozen? Did exposure settings make frame acquisition too slow? The app has to treat this as an operational problem, not just a thrown exception.

Vendor SDK failure

Vendor SDKs are often not clean, modern, predictable libraries. They may:

  • throw vague exceptions
  • return integer error codes
  • deadlock
  • hang
  • leak handles
  • fail only after hours of operation
  • become invalid after reconnect

So resilience is not just about .NET exceptions. It is also about defending your system from bad external components.

File save failure during inspection

Imagine inspection images are being saved during a run, and disk space runs out. The machine may still be generating results. Now you have a dangerous partial-success situation:

  • inspection physically happened
  • some results are in memory
  • some images are on disk
  • some metadata is in database
  • some output is missing

This is exactly the kind of production problem senior engineers think about all the time.


PART 2 — HOW IT ACTUALLY WORKS

Good resilience starts with classifying failures correctly.

1. Expected errors vs unexpected failures

This distinction matters a lot.

Expected errors

These are failures the system should anticipate as part of normal operation.

Examples:

  • machine not connected
  • command timeout
  • camera frame timeout
  • invalid recipe
  • file path unavailable
  • operator tries to start while machine is not homed
  • disk full
  • network share temporarily unavailable

These are not “bugs” in the classic sense. They are operational conditions. The system should handle them deliberately.

Unexpected failures

These are failures that indicate bugs, corrupt state, bad assumptions, or broken dependencies.

Examples:

  • null reference because internal state was inconsistent
  • unexpected vendor SDK exception type
  • race condition causing duplicate workflow completion
  • collection modified from wrong thread
  • impossible state transition
  • corrupted inspection result object

These should usually be treated more aggressively. Often you log, stop the workflow, move to safe state, and preserve evidence for debugging.

2. Exception handling strategy

A production system should have clear exception boundaries.

Do not scatter random try/catch blocks everywhere.

Instead, think in layers:

  • hardware adapter boundary
  • workflow boundary
  • background processing boundary
  • UI command boundary
  • application top-level boundary

Each boundary has a different responsibility.

Hardware adapter boundary

Convert messy SDK behavior into clean application-specific exceptions or result objects.

Example:

  • vendor timeout code becomes CameraTimeoutException
  • disconnected controller becomes MachineDisconnectedException

This isolates the rest of the system from vendor chaos.

Workflow boundary

Protect the inspection workflow from crashing unpredictably. If a fatal hardware error happens during inspection, the workflow should transition to Faulted or Stopping, not just die on a background thread.

UI command boundary

When an operator presses Start, Stop, or Load Recipe, you need user-safe feedback. The UI should not show raw stack traces. It should show a meaningful operational message.

Top-level boundary

Unhandled exceptions should be logged with full context and should force the app into a safe failure mode. In some cases, you may disable machine commands, show a fatal error screen, or require restart.

3. Retry, timeout, fallback, fail-fast

These words are often used loosely. In industrial systems, they need careful meaning.

Retry

Retry is appropriate only when the operation is transient and retry is safe.

Good candidates:

  • reconnect to machine status channel
  • save image to network share after temporary IO error
  • read non-critical telemetry again
  • query machine status again

Dangerous candidates:

  • send “start motion” again
  • send “dispense chemical” again
  • trigger camera again when unsure whether the first trigger succeeded
  • commit workflow completion twice

The key question is not “can it fail transiently?” The key question is “is it safe if I accidentally perform it twice?”

Timeout

Timeout is critical because vendor SDKs and hardware APIs often hang.

Without timeouts:

  • background threads get stuck forever
  • UI waits indefinitely
  • shutdown hangs
  • workflows never complete or fail cleanly

But timeouts must be tuned based on real machine behavior. Too short causes false failures. Too long makes the operator wait forever and delays recovery.

Fallback

Fallback means switching to a degraded but controlled mode.

Examples:

  • use local disk if network share fails
  • continue UI updates without thumbnails if image decode fails
  • use cached machine metadata if live read fails temporarily
  • allow offline review of existing results when machine is disconnected

Fallback is useful, but only when it preserves correctness. Never fake success.

Fail-fast

Fail-fast means stopping immediately when continuing would be unsafe or would destroy debuggability.

Examples:

  • machine/app state divergence detected
  • impossible workflow state
  • safety interlock alarm
  • corrupt recipe
  • result stream integrity broken in a way you cannot trust

In those cases, stopping early is not weakness. It is good engineering.


PART 3 — REAL PROBLEMS IN THIS SYSTEM

Now let’s apply this to:

A WPF desktop app controlling a wafer inspection machine

Machine command timeout

Suppose the app sends MoveToPositionAsync(x, y) and waits for completion.

A timeout may mean several different things:

  • command never arrived
  • machine accepted command but response was lost
  • machine is still moving
  • controller hung
  • motion completed but status polling failed

This is why timeout handling cannot simply do:

  • throw exception
  • show popup
  • continue

Instead, after timeout, the system usually needs a recovery sequence:

  1. mark command as uncertain
  2. stop issuing new commands
  3. attempt status re-sync
  4. query motion/alarm state
  5. possibly issue safe stop
  6. transition workflow into paused/faulted state
  7. require operator acknowledgement if trust is lost

The important thing is that timeout creates uncertainty, not just delay.

Hardware alarm during active inspection

Suppose an alarm occurs while inspecting wafer die 423 of 800.

Now you have multiple parallel concerns:

  • machine physical state
  • workflow state
  • acquired images
  • analysis pipeline
  • partial persisted results
  • UI state

Good handling usually looks like this:

  • immediately stop accepting further work into pipeline
  • cancel or drain background stages
  • mark current item as incomplete or invalid
  • snapshot machine alarm information
  • preserve partial results with explicit status
  • transition UI into alarm mode
  • force operator decision: retry item, skip item, abort lot, recover machine

This is not just exception handling. It is controlled workflow degradation.

Partial workflow completion

This is one of the hardest real-world problems.

Example:

  • 700 dies inspected successfully
  • 50 have images saved but analysis not finished
  • 10 were in-flight in memory
  • database commit for lot summary failed

What is the truth of the run?

A weak design loses trust here. A strong design explicitly models partial completion.

You need concepts like:

  • Completed
  • PartiallyCompleted
  • Faulted
  • ResultsPendingPersistence
  • RecoveryRequired

Without explicit status modeling, teams invent fragile boolean flags and end up with silent data corruption.

Losing synchronization between app state and machine state

This is a classic industrial failure mode.

The app thinks:

  • machine is idle
  • no active run
  • stage at home

The machine is actually:

  • still in run mode
  • paused on alarm
  • stage midway through motion
  • recipe loaded from previous job

This can happen after:

  • reconnect
  • app crash and restart
  • network drop
  • SDK reinitialization
  • machine reboot

When this happens, the system must re-synchronize deliberately. It cannot just resume normal operation.

A good recovery flow often includes:

  • query machine authoritative state
  • compare with local workflow state
  • detect mismatch categories
  • decide whether auto-recovery is allowed
  • otherwise require operator-assisted reconciliation

Example: “Machine reports inspection active, but application has no active job context.” That is not a popup problem. That is a controlled recovery problem.

Failure while streaming results or saving images

Real-time result streaming often involves multiple stages:

  • acquisition
  • preprocessing
  • analysis
  • visualization
  • persistence

If saving images fails but visualization continues, the operator might think everything is fine while forensic evidence is missing.

If analysis fails for some items but the stream keeps flowing, totals may look valid while defect data is incomplete.

That means each pipeline stage needs clear failure semantics.

Possible policies:

  • stop entire run on any persistence failure
  • continue inspection but mark run degraded
  • buffer temporarily and retry save
  • switch to emergency local spool folder
  • allow review-only mode until data integrity restored

There is no one universal answer. It depends on whether missing data is acceptable.


PART 4 — HOW WE USE IT IN .NET (PRACTICAL)

The practical .NET approach is to create clear error boundaries, domain-specific exceptions, safe timeout handling, and recovery-oriented workflow code.

1. Exception boundaries

A good pattern is:

  • low-level SDK layer translates raw failures
  • application service decides recovery action
  • UI layer shows operator-friendly message

Example: hardware adapter boundary

csharp
public sealed class MachineDisconnectedException : Exception
{
    public MachineDisconnectedException(string message, Exception? inner = null)
        : base(message, inner) { }
}

public sealed class MachineCommandTimeoutException : Exception
{
    public string CommandName { get; }
    public TimeSpan Timeout { get; }

    public MachineCommandTimeoutException(string commandName, TimeSpan timeout)
        : base($"Machine command '{commandName}' timed out after {timeout}.")
    {
        CommandName = commandName;
        Timeout = timeout;
    }
}

public interface IMachineController
{
    Task MoveToAsync(double x, double y, CancellationToken cancellationToken);
    Task StopAsync(CancellationToken cancellationToken);
    Task<MachineStatus> GetStatusAsync(CancellationToken cancellationToken);
}
csharp
public sealed class VendorMachineController : IMachineController
{
    private readonly IVendorSdk _sdk;

    public VendorMachineController(IVendorSdk sdk)
    {
        _sdk = sdk;
    }

    public async Task MoveToAsync(double x, double y, CancellationToken cancellationToken)
    {
        try
        {
            var success = await _sdk.MoveStageAsync(x, y, cancellationToken);

            if (!success)
            {
                throw new InvalidOperationException("Vendor SDK reported move failure.");
            }
        }
        catch (TimeoutException)
        {
            throw new MachineCommandTimeoutException("MoveTo", TimeSpan.FromSeconds(5));
        }
        catch (SocketException ex)
        {
            throw new MachineDisconnectedException("Machine connection lost during move command.", ex);
        }
        catch (VendorSdkDisconnectedException ex)
        {
            throw new MachineDisconnectedException("Vendor SDK reports machine disconnected.", ex);
        }
    }

    public Task StopAsync(CancellationToken cancellationToken)
        => _sdk.StopAsync(cancellationToken);

    public Task<MachineStatus> GetStatusAsync(CancellationToken cancellationToken)
        => _sdk.ReadStatusAsync(cancellationToken);
}

The important part is not the syntax. The important part is isolating vendor weirdness from the rest of the application.


2. Timeout patterns

Timeouts should not be hidden all over the place. They should be explicit.

csharp
public static class TaskTimeoutExtensions
{
    public static async Task<T> WithTimeout<T>(
        this Task<T> task,
        TimeSpan timeout,
        string operationName,
        CancellationToken cancellationToken = default)
    {
        using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
        var delayTask = Task.Delay(timeout, timeoutCts.Token);

        var completed = await Task.WhenAny(task, delayTask);

        if (completed == delayTask)
        {
            throw new TimeoutException($"Operation '{operationName}' timed out after {timeout}.");
        }

        timeoutCts.Cancel();
        return await task;
    }

    public static async Task WithTimeout(
        this Task task,
        TimeSpan timeout,
        string operationName,
        CancellationToken cancellationToken = default)
    {
        using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
        var delayTask = Task.Delay(timeout, timeoutCts.Token);

        var completed = await Task.WhenAny(task, delayTask);

        if (completed == delayTask)
        {
            throw new TimeoutException($"Operation '{operationName}' timed out after {timeout}.");
        }

        timeoutCts.Cancel();
        await task;
    }
}

Use it carefully:

csharp
await _machineController.MoveToAsync(x, y, cancellationToken)
    .WithTimeout(TimeSpan.FromSeconds(5), "Move stage", cancellationToken);

But here is the senior-engineer warning:

A timeout does not guarantee the underlying operation stopped.

This is especially true with vendor SDKs. Your app may stop waiting, but the machine may still be moving. So timeout must usually trigger reconciliation logic afterward.


3. Safe retry patterns

Blind retry is dangerous. Wrap only safe operations.

csharp
public sealed class RetryHelper
{
    public static async Task RetryTransientAsync(
        Func<Task> action,
        int maxAttempts,
        TimeSpan delay,
        Func<Exception, bool> shouldRetry,
        Action<int, Exception>? onRetry = null)
    {
        for (var attempt = 1; attempt <= maxAttempts; attempt++)
        {
            try
            {
                await action();
                return;
            }
            catch (Exception ex) when (attempt < maxAttempts && shouldRetry(ex))
            {
                onRetry?.Invoke(attempt, ex);
                await Task.Delay(delay);
            }
        }

        await action();
    }
}

Safe usage example for file save:

csharp
await RetryHelper.RetryTransientAsync(
    action: () => _imageStore.SaveAsync(image, path, cancellationToken),
    maxAttempts: 3,
    delay: TimeSpan.FromMilliseconds(500),
    shouldRetry: ex => ex is IOException || ex is UnauthorizedAccessException,
    onRetry: (attempt, ex) =>
    {
        _logger.LogWarning(ex,
            "Retrying image save. Attempt {Attempt}. Path={Path}, InspectionId={InspectionId}",
            attempt, path, inspectionId);
    });

But do not do this for uncertain machine commands like StartInspectionAsync() unless the command is designed to be idempotent or has a command ID that lets the machine reject duplicates.


4. Recovery flows after machine or workflow failure

The most important code in real systems is not “happy path” code. It is recovery code.

csharp
public sealed class InspectionWorkflowService
{
    private readonly IMachineController _machineController;
    private readonly IInspectionStateStore _stateStore;
    private readonly ILogger<InspectionWorkflowService> _logger;

    public InspectionWorkflowService(
        IMachineController machineController,
        IInspectionStateStore stateStore,
        ILogger<InspectionWorkflowService> logger)
    {
        _machineController = machineController;
        _stateStore = stateStore;
        _logger = logger;
    }

    public async Task RunInspectionAsync(InspectionJob job, CancellationToken cancellationToken)
    {
        try
        {
            await _stateStore.MarkRunningAsync(job.Id, cancellationToken);

            foreach (var die in job.Dies)
            {
                cancellationToken.ThrowIfCancellationRequested();

                await InspectDieAsync(job.Id, die, cancellationToken);
            }

            await _stateStore.MarkCompletedAsync(job.Id, cancellationToken);
        }
        catch (MachineCommandTimeoutException ex)
        {
            _logger.LogError(ex,
                "Machine command timeout during inspection. JobId={JobId}",
                job.Id);

            await TryEnterSafeStopAsync(job.Id, "Machine command timeout", cancellationToken);
            await _stateStore.MarkFaultedAsync(job.Id, "Machine timeout", cancellationToken);
            throw;
        }
        catch (MachineDisconnectedException ex)
        {
            _logger.LogError(ex,
                "Machine disconnected during inspection. JobId={JobId}",
                job.Id);

            await _stateStore.MarkRecoveryRequiredAsync(job.Id, "Machine disconnected", cancellationToken);
            throw;
        }
        catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
        {
            _logger.LogInformation(
                "Inspection canceled by request. JobId={JobId}",
                job.Id);

            await TryEnterSafeStopAsync(job.Id, "Canceled", CancellationToken.None);
            await _stateStore.MarkCanceledAsync(job.Id, CancellationToken.None);
            throw;
        }
        catch (Exception ex)
        {
            _logger.LogCritical(ex,
                "Unexpected fatal inspection failure. JobId={JobId}",
                job.Id);

            await TryEnterSafeStopAsync(job.Id, "Unexpected failure", CancellationToken.None);
            await _stateStore.MarkFaultedAsync(job.Id, "Unexpected fatal error", CancellationToken.None);
            throw;
        }
    }

    private async Task InspectDieAsync(string jobId, DiePosition die, CancellationToken cancellationToken)
    {
        // move, acquire, analyze, persist
    }

    private async Task TryEnterSafeStopAsync(string jobId, string reason, CancellationToken cancellationToken)
    {
        try
        {
            _logger.LogWarning(
                "Attempting safe stop. JobId={JobId}, Reason={Reason}",
                jobId, reason);

            await _machineController.StopAsync(cancellationToken);
        }
        catch (Exception ex)
        {
            _logger.LogCritical(ex,
                "Safe stop failed. JobId={JobId}, Reason={Reason}",
                jobId, reason);
        }
    }
}

Notice the mindset:

  • catch specific operational failures first
  • transition workflow state explicitly
  • attempt safe stop
  • never assume stop succeeded
  • preserve logs and recovery state

That is production thinking.


5. Logging with enough context

A log line that says:

Error during inspection

is almost useless.

In real systems, you need context rich enough to reconstruct the failure.

Good context often includes:

  • job/lot ID
  • wafer ID
  • die position
  • recipe version
  • machine state
  • command name
  • timeout value
  • camera ID
  • image path
  • thread or pipeline stage
  • correlation or operation ID
  • operator action that triggered it

Example:

csharp
_logger.LogError(ex,
    "Image save failed. JobId={JobId}, WaferId={WaferId}, DieX={DieX}, DieY={DieY}, CameraId={CameraId}, Path={Path}, Recipe={Recipe}",
    job.Id,
    wafer.Id,
    die.X,
    die.Y,
    camera.Id,
    imagePath,
    recipe.Version);

That is the difference between “we saw an error” and “we can actually debug it tomorrow.”


6. Operator-safe UI messaging

Never dump raw exception text to machine operators.

Bad:

  • “NullReferenceException at ImagePipeline.cs line 84”
  • “SocketException 10054”
  • giant stack trace popup

Better:

  • “Camera image acquisition timed out. Inspection has been paused.”
  • “Machine connection was lost. Reconnect and re-synchronize before continuing.”
  • “Inspection images could not be saved. Current run was stopped to protect data integrity.”

Keep technical detail in logs, not in operator messages.

A simple mapping pattern works well:

csharp
public sealed class OperatorMessageService
{
    public string ToOperatorMessage(Exception ex) => ex switch
    {
        MachineDisconnectedException =>
            "Machine connection was lost. Please reconnect and verify machine state.",

        MachineCommandTimeoutException =>
            "A machine command timed out. The system is verifying machine state before continuing.",

        IOException =>
            "Failed to save inspection data. Please verify storage availability.",

        _ =>
            "An unexpected system error occurred. The operation was stopped safely."
    };
}

PART 5 — COMMON MISTAKES (VERY REALISTIC)

1. catch (Exception) everywhere

This usually starts from good intentions. Teams want the app to “never crash.”

So they wrap everything.

Result:

  • real bugs get hidden
  • state corruption continues
  • workflows limp forward in invalid state
  • logs become noisy and useless
  • root causes become impossible to identify

Production consequence: the app looks stable on the surface, but operators start seeing strange behavior, missing results, frozen workflows, and random recovery issues.

2. Swallowing errors

This is one of the worst industrial mistakes.

Example:

csharp
try
{
    await SaveImageAsync(...);
}
catch
{
    // ignore
}

That is not resilience. That is data loss.

Production consequence:

  • missing images
  • inconsistent reports
  • invalid defect evidence
  • debugging nightmare because failure happened silently

3. Retrying dangerous operations blindly

Teams often build a “generic retry helper” and apply it to everything.

That is a trap.

Retrying a read is different from retrying a physical command.

Production consequence:

  • duplicate machine actions
  • inconsistent stage position
  • repeated device triggers
  • duplicate workflow commits
  • safety risk in real hardware scenarios

4. Showing technical exception text directly to operators

Operators need actionable operational guidance, not developer detail.

Production consequence:

  • confusion
  • wrong recovery action
  • unnecessary panic
  • support tickets with screenshots of meaningless stack traces

5. Not restoring system to safe state

Some systems detect failure correctly but do not perform safe shutdown or safe pause.

Example:

  • UI shows “inspection failed”
  • but acquisition thread still running
  • machine still executing
  • pipeline still buffering results

Production consequence:

  • system drift
  • app/machine desynchronization
  • more damage after the initial error than from the original error itself

PART 6 — PERFORMANCE & TRADE-OFFS

Retry cost

Retries are not free.

They add:

  • latency
  • duplicate load
  • queue buildup
  • slower recovery
  • operator waiting time

In real-time systems, aggressive retries can make the whole system feel hung. Sometimes one fast failure is better than three slow retries.

Timeout tuning trade-offs

Timeouts are always a balancing act.

If timeout is too short:

  • you get false alarms
  • you interrupt valid slow operations
  • operators lose trust in the system

If timeout is too long:

  • recovery is delayed
  • UI appears frozen
  • cancellation feels broken
  • stuck SDK calls occupy resources too long

Good timeout values usually come from real machine measurements, not guesswork.

You often want different timeout classes:

  • UI responsiveness timeout
  • machine command timeout
  • reconnect timeout
  • persistence timeout
  • shutdown timeout

Not everything should use “30 seconds.”

Fail-fast vs aggressive recovery

Fail-fast is better when:

  • correctness matters more than uptime
  • machine state is uncertain
  • duplicate execution is dangerous
  • results cannot be trusted

Aggressive recovery is better when:

  • the operation is read-only or idempotent
  • the failure is clearly transient
  • you can preserve correctness while retrying
  • operator disruption is costly

Senior engineers do not ask, “Should we always retry or always fail fast?” They ask, “What is the cost of being wrong in each direction?”


PART 7 — SENIOR ENGINEER THINKING

1. Experienced engineers classify failures

Strong engineers do not treat all errors equally.

They classify by questions like:

  • Is it transient or persistent?
  • Is it expected or unexpected?
  • Is it safe to retry?
  • Is physical state uncertain?
  • Is data integrity at risk?
  • Is operator intervention required?
  • Can we recover automatically?
  • Can we still trust the workflow state?

That classification drives design.

2. Design for recovery, not just detection

Junior systems focus on detecting error.

Senior systems focus on what happens next.

That means designing:

  • safe stop flows
  • re-sync flows
  • reconnect flows
  • partial result marking
  • degraded mode behavior
  • operator acknowledgement steps
  • restart and resume rules

A system that detects failures but cannot recover cleanly is not really resilient.

3. Balance robustness vs complexity

You can over-engineer resilience.

If every component has:

  • retries
  • fallback modes
  • buffering
  • local spooling
  • reconnect loops
  • recovery orchestration
  • circuit breakers
  • custom state reconciliation

then the system becomes very hard to reason about.

The goal is not “maximum cleverness.” The goal is “predictable behavior under failure.”

Simple, explicit, boring recovery flows are usually better than magical automation.

4. Preserve debuggability under failure

This is a very senior principle.

Many systems become least observable exactly when failure happens.

To avoid that, design failure handling to preserve evidence:

  • structured logs
  • operation IDs
  • machine state snapshot on fault
  • current recipe and wafer context
  • workflow step at failure
  • persistence status of partial outputs
  • vendor error codes
  • timestamps around timeout and retry

Under failure, you need more signal, not less.

Final takeaway

In industrial desktop systems, error handling is really about trust.

Can the operator trust what the UI says? Can the engineer trust the saved results? Can the support team reconstruct what happened? Can the machine be brought back to a safe state? Can the workflow recover without hidden corruption?

That is why resilience is not “add try/catch.”

It is:

  • classify failures correctly
  • isolate bad dependencies
  • enforce explicit state transitions
  • stop safely when trust is lost
  • retry only when safe
  • preserve evidence for debugging
  • design recovery as part of the system, not as an afterthought

If you want, I can turn this into the same interview-prep format you used before: Part 1 knowledge review + Part 2 tough interview Q&A with model answers.

Docs-first project memory for AI-assisted implementation.